- Finish simple linear regression
- Analysis of residuals
- How to report your statistical results
- Non-linear regression
- Multiple linear regression
- Git and GitHub
- One factor ANOVA
- Experimental Design (Tuesday)
10/25/2018
To develop a better predictive model than is possible from models based on single independent variables.
To investigate the relative individual effects of each of the multiple independent variables above and beyond the effects of the other variables.
The individual effects of each of the predictor variables on the response variable can be depicted by single partial regression lines.
The slope of any single partial regression line (partial regression slope) thereby represents the rate of change or effect of that specific predictor variable (holding all the other predictor variables constant to their respective mean values) on the response variable.
Additive model \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + B_jx_{ij} + \epsilon_i\]
Multiplicative model (with two predictors) \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + B_3x_{i1}x_{i2} + \epsilon_i\]
RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t') y <- RNAseq_Data$Lipid_Conc g1 <- RNAseq_Data$Gene01 g2 <- RNAseq_Data$Gene02 g3 <- RNAseq_Data$Gene03 g4 <- RNAseq_Data$Gene04 Mult_lm <- lm(y ~ g1*g2) summary(Mult_lm)
Add_lm <- lm(y ~ g1+g2) summary(Add_lm)
RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t') RNAseq_lm_linear <- lm(y ~ g4) summary (RNAseq_lm_linear) influence.measures(RNAseq_lm_linear) RNAseq_lm_poly <- lm(y ~ poly(g4, 2)) summary (RNAseq_lm_poly) lines(y, fitted(lm(y ~ poly(g4, 2))), type="l", col="blue") influence.measures(RNAseq_lm_poly)
library(car) scatterplotMatrix(~var1+var2+var3, diag=”boxplot”)
RNAseq3_lm <- lm(y ~ g1+g2+g3) summary(RNAseq3_lm) plot(RNAseq3_lm)
library(car) scatterplotMatrix(~g1+g2+g3, diag=”boxplot”) scatterplotMatrix(~y+g1+g2+g3, diag=”boxplot”)
To get tolerance (1-\(R^2\)) calculations
1/vif(lm(y ~ g1 + g2 + g3))
How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….
Which variables to keep/ discard when building a multiple regression model?
Selecting from candidate models representing different biological processes.
“models should be pared down until they are minimal and adequate”
Crawley 2007, p325
Model should predicts well
Approximates true relationship between the variables
Be able to evaluate a wider array of models. Not only or more “reduced” models.
NOTE: Reduced vs. full models are referred to as nested models. Non-subset models are called non-nested models.
Don’t confuse with nested experimental designs or sampling designs.
How to accomplish these goals To answer this, we need
How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….
Which variables to keep/ discard when building a multiple regression model?
Selecting from candidate models representing different biological processes.
MSresiduals- represents the mean amount of variation unexplained by the model, and therefore the lowest value indicates the best fit.
Adjusted \(r^2\) - (the proportion of mean amount of variation in response variable explained by the model) is calculated as adj. r2 which is adjusted for both sample size and the number of terms. Larger values indicate better fit.
Mallow’s Cp - is an index resulting from the comparison of the specific model to a model that contain all the possible terms. Models with the lowest value and/or values closest to their respective p (the number of model terms, including the y-intercept) indicate best fit.
Akaike Information Criteria (AIC) - there are several different versions of AIC, each of which adds a different constant designed to penalize according to the number of parameters and sample size to a likelihood function to produce a relative measure of the information content of a model. Smaller values indicate more parsimonious models. As a rule of thumb, if the difference between two AIC values (delta AIC) is greater than 2, the lower AIC is a significant improvement in parsimony.
Schwarz or Bayesian Information Criteria (BIC or SIC) - is outwardly similar to AIC. The constant added to the likelihood function penalizes models with more predictor terms more heavily (and thus select more simple models) than AIC. It is for this reason that BIC is favored by many researchers.
leaps package in R. Smart search among a potentially huge number of modelsFor prediction: All models with Cp < p predict about equally well. Don’t get carried away with a “best”.
For explanation: If numerous equally well fitting models fit the data, it is difficult to deduce which predictor “explains” the response.
General caveat: “regression is not causation”. Experiment needed to get to causal explanation.
Percent time that male mice experiencing discomfort spent “stretching”.
Data are from an experiment in which mice experiencing mild discomfort (result of injection of 0.9% acetic acid into the abdomen) were kept in:
The results suggest that mice stretch the most when a companion mouse is also experiencing mild discomfort. Mice experiencing pain appear to “empathize” with co-housed mice also in pain.
From Langford, D. J.,et al. 2006. Science 312: 1967-1970
In words: stretching = intercept + treatment
The model statement includes a response variable, a constant, and an explanatory variable.
The only difference with regression is that here the explanatory variable is categorical.
RNAseq_Data <- read.table('RNAseq_lip.tsv', header=T, sep='\t')
g1 <- RNAseq_Data$Gene01
Pop <- RNAseq_Data$Population
boxplot(g1~Pop, col=c("blue","green"))
#Or, to plot all points:
stripchart(g1~Pop, vertical=T, pch=19, col=c("blue","green"),
at=c(1.25,1.75), method="jitter", jitter=0.05)
Pop_Anova <- aov(g1 ~ Pop)
summary(Pop_Anova)
Factor is sex (Male vs. Female)
Factor is fish tank (10 tanks in an experiment)
Factor is family (measure multiple sibs per family)
Factor is temperature (10 arbitrary temps over nat. range)
lm assumes that all effects are fixed.lme instead (part of the nlme package).